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ABSTRACT 

We have developed a new COFACTOR webserver 
for automated structure-based protein function 
annotation. Starting from a structural model, given 
by either experimental determination or computa- 
tional modeling, COFACTOR first identifies 
template proteins of similar folds and functional 
sites by threading the target structure through 
three representative template libraries that have 
known protein-ligand binding interactions, Enzyme 
Commission number or Gene Ontology terms. The 
biological function insights in these three aspects 
are then deduced from the functional templates, 
the confidence of which is evaluated by a scoring 
function that combines both global and local 
structural similarities. The algorithm has been 
extensively benchmarked by large-scale bench- 
marking tests and demonstrated significant advan- 
tages compared to traditional sequence-based 
methods. In the recent community-wide CASP9 
experiment, COFACTOR was ranked as the best 
method for protein-ligand binding site predictions. 
The COFACTOR sever and the template libraries are 
freely available at http://zhanglab.ccmb.med. umich 
.edu/COFACTOR. 

INTRODUCTION 

The biological function of a protein molecule is decided 
by its 3D-shape, which eventually determines how the 
molecule interacts with other molecules in living cells. 
As such, considerable efforts have been made to 
determine the structure of the protein molecules and to 
deduce the biological functions based on their 3D-shape 
(1-3). One of the most common structure-based 
approaches in protein function annotation is to detect 
homologous template proteins by global structure com- 
parisons and then transfer known functional annotations 



from the templates (2,4,5). However, the evidence of 
global structural similarity is usually insufficient for 
accurate functional inference, as proteins possessing 
similar global fold can perform different biological func- 
tions. The classic examples include the proteins with 
a-/P-barrel fold, which is inhabited by both enzymatic 
and non-enzymatic proteins (6). Accordingly, many con- 
temporary approaches have been designed to identify 
local structural similarity of functionally important 
residues for drawing functional inferences (7,8). 
However, the functional annotation based on local struc- 
ture alone can result in high false-positive rate, especially 
when the target protein has a low sequence identity to 
the template proteins or the target structure on its own 
has a low-resolution 3D structure (3,9). 

In this study, we describe a newly developed 
COFACTOR server, which combines both global and 
local structural comparison algorithms to deduce the bio- 
logical functions of proteins, starting from their 3D struc- 
ture. The output of the server includes function 
annotations in three key aspects: protein-ligand binding 
interactions, Enzyme Commission (EC) (10) and Gene 
Ontology (GO) (11). Keeping in mind that high-resolution 
experimental structures are unavailable for most of the 
protein targets in genome databases, the algorithm has 
been extensively trained for low-resolution structures 
generated from computational structure predictions. 
Meanwhile, experimental structures undoubtedly meet 
the highest structural requirement and the predic- 
tion accuracy improves using these structures. In both 
large-scale benchmark (12) and blind experiments (2), 
the COFACTOR method has demonstrated significant 
advantages over other state-of-the-art sequence- or 
structure-based comparative methods. 

MATERIALS AND METHODS 

COFACTOR algorithm 

The input to the COFACTOR server is the 3D-structure 
of a target protein, which can be obtained from 
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either structure prediction or experimental determination. 
Figure 1 shows a general overview of the procedure 
followed on the COFACTOR server and the analysis 
done using the server, which includes detection of struc- 
tural analogs in the PDB library and prediction of three 
different aspects of protein function, namely, EC 
numbers, GO terms and ligand binding sites. The 
structure-based function inferences are made in two 
steps, i.e. global structural alignment followed by local 
structural similarity search. 

Global structural similarity identification 

COFACTOR first identifies the template proteins of 
similar fold/topology by matching the query structure 
with all proteins in three newly developed representative 
functional libraries, which have known protein-ligand 
binding information, EC numbers and GO terms 
(J. Yang, A. Roy and Y. Zhang, submitted for 
publication). The global structure match is conducted by 
TM-align (13), a heuristic algorithm for global protein 
structure alignment, which starts from multiple seed align- 
ments (gapless threading, secondary structure match and 



the combination of the two), followed by Needleman- 
Wunsch dynamic programming refinement (14). The 
objective function of the TM-align searching is 
TM-score (15): 



TM 



score = max 



- <=i i+ 



(1) 



where d t is the distance between ith pair of C a atoms of 
query and template and L ali is the number of aligned 
residue pa irs ide ntified by TM-align. d 0 is given by 
d 0 = \.2\1JL - 15 - 1.8 and L is the length of the query 
protein. Since TM-score weights the short-distance residue 
pairs stronger than the long-distance ones, it is more sen- 
sitive to the global topology of proteins than the trad- 
itional structural similarity measurement RMSD. 
Meanwhile, because only the aligned residues are 
calculated in the summation which is normalized by the 
target length, TM-score in Equation (1) counts for both 
alignment accuracy and the alignment coverage in a single 
parameter. Generally, a protein pair with TM-score >0.5 





Predicted functions 



Figure 1. Illustration of structure-based function annotation by the COFACTOR server, starting from the query structure (shown in green). 
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indicates that they have the same fold while that with 
TM-score <0.3 have random structural similarity (16). 

The global structural alignment between the query and 
template structures is useful for exploring fold/family 
relationships of newly solved structure or predicted struc- 
tural models. However, some folds are functionally 
diverse and in these cases, function can be accurately pre- 
dicted only by evaluating the similarity of active/binding 
site residues that are involved in the function. Moreover, 
in many cases, the functional motifs remain conserved 
during the evolution to maintain the function, even 
when the global structural similarity dwindles. Thus, a 
local sequence and structural comparion of functional 
sites may provide a more reliable way of functional anno- 
tation for the query proteins. 

Accordingly, on COFACTOR server, all template 
proteins with a non-random structural similarity (i.e. 
TM-score > 0.3) (16) to the query structure (or up to 100 
top templates regardless of TM-score are used if < 100 
non-random templates are identified) in each of the 
three function libraries (see below) are screened further 
based on their local similarity to query structure. 

Local functional site identification 

In the second step, a heuristic algorithm has been 
developed to identify the best local functional site match 
between the query and template structures. In Figure 2, a 
multiple sequence alignment is first constructed and evo- 
lutionarily conserved residues in the query sequence are 
identified based on their Jensen-Shannon divergence 
(JSD) score (17). The conformations of various triplet 
residues from the conserved residue pool are excised 
from the query structure to construct a set of local 
3D-structural motifs. Each of the local query motifs is 
then superimposed onto the known functional site 
residues of the template protein. 

To further refine the local structural match of the func- 
tional sites, the complete structure of the query and 
template proteins are brought together in the same refer- 
ence frame, based on the rotation and translation matrices 
acquired from the initial motif superposition. A sphere of 
radius r is then defined around the geometric center of 
template motif, where r is the maximum distance of any 
template functional site residue from the geometric center. 
The residues from query and template proteins within the 
sphere are re-aligned by an iterative alignment procedure 
similar to TM-align (13), i.e. scoring matrix is repeatedly 
calculated from the current structural superposition and is 
used to generate new optimized superposition by dynamic 
programming, until converged. The sphere thus represents 
a pseudo-functional site, under which the local structural 
and sequence similarity (L sim ) between query and template 
proteins is evaluated by 

j i=Nrii J J '=JVaii 

where N t represents the total number of residues present 
within the template sphere, is the number of 

query-template aligned residue pairs within the sphere, 



dj is the C a distance between tth aligned residue pair, Ma 
is the normalized BLOSUM62 substitution scores between 
/'th pair of residues and d 0 is the distance cutoff chosen to 
be 3.0 A. The second term in Equation (2) is to account for 
the evolutionary information of the functional sites. For 
each binding pocket on the template, this procedure is 
implemented for all the conserved query motifs and the 
one with the highest L s j m is recorded (Figure 2). 

Functional analyses 

The COFACTOR server provides a variety of available 
annotations for the query protein using the templates, 
including EC number, GO and protein-ligand binding 
sites. We provide a brief overview of the three aspects of 
predicted functions by COFACTOR server below. 

Enzyme Commission number 

For the purpose of classifying enzymatic proteins, all 
enzyme protein structures with annotated EC number(s) 
have been collected from the PDB library (18) with the 
active site residue information mapped using Catalytic 
Site Atlas (19). As of January 2012, this compiled 
enzyme template library contains 8392 protein structures. 

The active site motifs of the template structures are im- 
portant for the local structural comparison and the query 
active site identification. For the template structures where 
the active site residues are known, the template motifs are 
defined by these annotated functional sites (19). 
Otherwise, the algorithm uses spatially clustered and evo- 
lutionarily conserved residues for generating the template 
motifs (A. Roy, S. Mukherjee, P. S. Hefty and Y. Zhang, 
submitted for publication). For the former cases, residue 
correspondences from the local alignment results are 
mapped onto the query structure, which are used for pre- 
dicting catalytic residues in the query; while in the latter, 
only predicted EC numbers are reported. 

The confidence score for EC number prediction reflects 
both local and global similarities between query and 
template proteins and is defined as: 



bL 1 +e -(0.25L sim SS BS +TM-score+2.5ID Str ) l > W 

where L sim defined in Equation (2) and TM-score in 
Equation (1) measure local and global similarity between 
query and template enzyme, respectively, IDstr is identity 
between query and template in structurally aligned region 
of TM-align alignment, and 55 B s is sequence similarity 
between predicted active site residues of query and 
known active site residues of template. Finally, the top 
five scoring hits are reported. 

Gene ontology terms 

The GO is a widely used machine-legible approach for 
automatic functional annotation. To this end, a second 
library of protein structures that have known GO terms 
was created using PDB-GO mapping taken from the Gene 
Ontology Annotation database (http://www.ebi.ac.uk/ 
GOA/) and SIFTS project (http://www.ebi.ac.uk/pdbe/ 
docs/sifts/). This library contains 24 035 non-redundant 
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1. Construction of local 3D-motifs for query protein 



AASGVATHTDEET ITPVTjTGGTTLiqLNFDTSADL 
— GGHDVNLNAQ YTDITIGTPPFKVILDTSSDL 
YLGSENDNINIM YGEGEVGDNHQHLIFDTSADL 
--EFDNVNLNVL FGEJ^LGDQKFHFLFMSSDV >|. 
—GGHDVNLNAQ YTDITLGTPPFRVILDt|sSDL 
GSENDVINANIM YGEGEVGDQKFRLIFDdsADL 1 

NY I K T DL 

Conserved residues in query 

(using Jensen-Shannon divergence score) 




Query structure 



3D-motifs 



2. Identification of functional site in query protein 

Structure 




r '*<J (local) 
Template motif 

(Functional site residues) 

Figure 2. Flowchart of functional site identification by the COFACTOR server, (i) Conserved residues in query sequence are identified based on 
Jensen-Shannon diverge score, which are then used to glean local 3D-fragments from the query structure, (ii) Each local 3D-motif of query is aligned 
with the fragments collected from functional site of template protein and the local similarity between query and template protein is evaluated using 
L sim Equation (2). Finally, the best match among all the probable sets with the best local match (i.e. highest L sim ) is selected. The residues of query 
protein (yellow) are shown in cyan, while those in template protein (gray) are shown in magenta. 



protein chains, associated with 13 757 unique GO terms, 
as of January, 2012. 

The procedure of identifying and scoring the identified 
homologs in the GO template library is similar to that 
used for EC number prediction, however the template 
motifs for the local structural comparisons are generated 
using both known active and ligand-binding site residues 
rather than active residues only. Furthermore, based on 
the assumption that each protein domain contributes in- 
dependently to the protein function, the GO terms 
ascribed to the top five ranking hits are reconciled based 
on the PIPA algorithm (20), so that the consensus predic- 
tions identifies the intersection of functions among the top 
hits and provides specific annotation to the query protein. 

Protein-ligand binding sites 

Ligand binding pockets and ligand-interacting residues in 
the query protein are identified based on both global and 
local structural similarities to a comprehensive binding 
site template library, which contains 76 679 binding sites, 
including information on protein-protein, protein-nucleic 



acid, protein-lipid and protein-small molecule 
interactions. 

The binding pose of the template ligands in the query 
structure is predicted based on the superposition matrix 
acquired from the local alignment of query and template 
binding site residues. A quick rigid body Metropolis 
Monte Carlo simulation of the superposed ligand is 
followed to improve the local geometry, where the 
energy term to guide the simulation is defined as the 
sum of the number of contacts made by template ligand 
with the predicted binding site residues, the reciprocal of 
the number of ligand-protein clashes, and the contact 
distance error which is calculated as difference between 
inter-atomic ligand-protein contact distance in template 
and that in query model. Here, contacts are those inter- 
actions that are within a distance of 0.5 A plus the sum of 
the van der Waals radius of protein atom and ligand atom, 
while clashes are those in which the inter-atomic distance 
is less than sum of their van der Walls radii. The side 
chains of ligand binding residues are further optimized 
using Scwrl4 (21). 
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Finally, the predicted ligand conformations from all 
templates are clustered based on the spatial proximity 
with a distance cutoff 8 A. If a binding pocket binds 
multiple ligands (e.g. an ATP-binding pocket may also 
bind MG, P0 4 3 ~ and ADP), ligands within the same 
pocket are clustered further based on their chemical simi- 
larity (Tanimoto coefficient cutoff = 0.7) using the 
average linkage clustering procedure to rank the predicted 
binding sites. 

From each cluster, the protein-ligand complex with 
highest ligand-binding confidence score (C-score LB ) is 
eventually selected as the functional site predictions for 
the query protein. C-score LB is defined as: 

2 

C - score LB = z— -. - 1, 

l +e -(N^><(0-25Lsim+TM-score+2.5ID str + TT ? z5; )) 

(4) 

where N is the number of template ligands in a cluster and 
iVtot is the total number of predicted ligands using the 
templates. L sim defined in Equation (2) and TM-score 
defined in Equation (1), measuring local and global simi- 
larity of the query to the template protein, respectively. 
ID str is sequence identity between the query and the 
template in the structurally aligned region. (D) is the 
average distance of the predicted ligand to all other pre- 
dicted ligands in the same cluster. 



OUTPUT 

For each submitted protein, the user will be notified by 
email when the job is completed and the result data are 
reported on the COF ACTOR homepage. Each of the 
COFACTOR result page consists of four main 
tables (see, e.g., http://zhanglab.ccmb.med.umich.edu/ 
COFACTOR/example/). 

In the first table, structural alignments of the query with 
the top 10 template proteins ranked by TM-score, 
identified from the PDB library, are displayed using an 
interactive Jmol applet (22,23). The table provides 
details of the structural alignment as generated by 
TM-align (13), including TM-score, alignment coverage 
(fraction of residues aligned in the query), RMSD and 
the sequence identity in the structurally aligned region. 
Each of the structural alignments can be viewed inter- 
actively in the Jmol applet by clicking the corresponding 
radio buttons. The links for downloading the coordinate 
files of superposed structures are provided in the same 
table. 

The second table presents the top five enzyme templates 
ranked by confidence scores and the predicted catalytic 
residues in the query. These predicted catalytic residues 
are visually displayed using the Jmol applet in the same 
table. 

The third table lists top scoring template proteins that 
are annotated with GO terms. Usually, each template 
protein is associated with multiple GO terms that 
describe different aspects of biological and cellular func- 
tions. As the template proteins have additional functional 
domains, rather than simply transferring GO annotation, 



the server presents the most frequently occurring GO 
terms in each of the three functional aspects (molecular 
function, biological process and cellular component), 
which are reconciled from the top five homologs. A 
mouse hover over each GO term provides its definition. 

The last table contains information on protein-ligand 
binding location in the query structure. Top 10 predictions 
are presented with the information on the template 
protein, the template ligand and the query residues 
which are likely to be involved in binding interactions. 
These predictions and interactions are visualized using 
the Jmol applet, where the ligand atoms are shown as 
spheres and binding site residues in query are highlighted 
using ball and stick (Figure 3). 



PERFORMANCE OF WEB SERVER 

The COFACTOR algorithm has been extensively trained 
and tested on large-scale benchmarks. In a recent study 
(12), COFACTOR was tested on 501 proteins, which 
harbor 582 natural and drug-like ligand molecules. 
Starting from the low-resolution structural models 
generated by I-TASSER (24), the method successfully 
identifies ligand-binding pocket locations for 65% of 
apo receptors with an average distance error 2 A. The 
average precision of binding-residue assignments is 46 
and 137% higher than that by FINDSITE (4) and 
ConCavity (25), which were designed to identify 
protein-ligand binding sites. 

In the recent community-wide CASP9 experiment 
where all predictions were made before the experimental 
results were released (2), COFACTOR achieved a 
binding-site prediction precision 72% and Matthews cor- 
relation coefficient 0.69 for the 31 blind test proteins, 
which was significantly higher than all other participating 
methods. As CASP9 assessors concluded, among all 33 
participant groups 'Two groups (FN096, Zhang; FN339, 
I-TASSER_FUNCTION) performed better than the rest, 
while the following 10 prediction groups performed com- 
parably well' (2). 

To examine the ability of this approach to predict two 
other unambiguously defined concepts of functions: EC 
numbers (10) and GO (11) terms, especially with new 
settings taken by the COFACTOR server, we tested the 
server approach on a large benchmark set of 450 
non-homologous proteins collected from PDB. As experi- 
mental controls, we select those commonly used 
approaches that are based on sequence-profile alignment 
(26), profile-profile alignment (27) and HMM-HMM 
alignment (28). In all experiments, close homologs of 
query proteins were intentionally removed from the 
template libraries using a sequence identity cut-off 30%, 
before the predictions were made. Supplementary Figure 
SI summarizes the performance of COFACTOR to 
identify the correct function and the improvement 
achieved in function prediction. For instance, if we 
consider the identity of first three digits of EC number 
as a criteria to evaluate the correctness of prediction, func- 
tional annotations were transferred correctly from the top 
hit of COFACTOR in 156/318 enzymatic test proteins, 
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Template proteins with similar binding site: 



Rank Cscore LB PDB TM-score RMSD 2 
Hit 



0 28 3r4qA 



0 16 1IODA 



0.15 
0 14 
0 10 
0 04 
0 04 
0 03 
0.01 



IqinA 
lew A 
1froA 



0 785 
0716 
0.753 
0.756 
0 701 
0751 



2p7kA 0 732 

1xtkA 0.722 

2wl3A 0.700 

3ed4B 0296 



235 
257 
2 31 
256 
232 
262 
2.53 
2.30 
2 86 
5 55 



IDEN a Cov. BS-score Lig. Download Predicted binding site residues in the model 
Name Complex 

0 270 0 925 1 21 CO Download 70,105.115 

0 174 0.908 0 90 BLM Download 67,68 69.72 105 107.111 112.113.115 

0 209 0 917 0 85 FCN Download 71,97.105.115 

0.133 0.942 0 86 GIP Download 70.93 103.105.115.120 

0 143 0 867 0 98 BLM Download 6S.67 68 69.70.74.99 1 03. 1 07. 111 .113 115. 118.11 

0 133 0 942 0 83 GSB Download 73.76 98 101 117. 118 

0 171 0 925 1 39 CIT Download 9.11.47.56 

0 168 0 892 1.10 BLM Download 34,35.45,56.58 59 60 

0 098 0 933 0 83 GOL Download 79,83.92 95 

0 067 0 625 0 87 GOL Download 79 85.91.104 



Figure 3. An excerpt of the result page showing ligand-binding site analysis for a Glyoxalase family protein from Bacillus anthracis (PDB ID: 2qqz). 
The server identifies high global and local similarity to Lactoylglutathione lyase of Agrobacterium tumefaciens, suggesting that the query also has a 
similar metal-ion binding site, which is required for catalysis in Glyoxalase I enzymes. The protein-ligand interactions are visualized using the Jmol 
applet. 



which is approximately 27, 9 and 12% higher than the 
results obtained using the top hit by PSI-BLAST (26), 
MUSTER (27) and HHsearch (28), respectively. 
Similarly after removing close homologs from the 
template library, for the 337 test proteins, GO terms are 
annotated correctly (Fsim>0.5, see Supplementary 
Material) by COFACTOR for 49 and 64% proteins 
using the top one and the best in top five template 
proteins, respectively (Supplementary Table SI). Using 
the top one (best in top 5) template proteins, 
PSI-BLAST, MUSTER and HHsearch can predict GO 
terms correctly for 38% (49%), 44% (60%) and 41% 
(56%), respectively. 

Here, we should note that the PSI-BLAST, MUSTER 
and HHsearch methods start only from query sequences, 
which are therefore much faster than the entire pipeline of 
sequence-to-structure-to-function in COFACTOR since 
the latter starts from the structural models predicted by 
I-TASSER (although the procedure of structure search by 
COFACTOR itself takes only less than 1 h in general). 
Nevertheless, these data demonstrate encouraging results 
that the use of protein structure information can help to 
obtain significant gains in the function annotations. 



CONCLUSIONS 

We have developed the COFACTOR server for auto- 
mated structure-based functional annotation. One of the 
major advantages of the COFACTOR algorithm is the 
combination of the global and local structural compari- 
sons. Although the global structural similarity is import- 
ant for functional inference, we have witnessed a number 
of examples in both the benchmark and the CASP experi- 
ments, where COFACTOR successfully identified the 
correct functional homologs, which have different global 
folds but with similar binding sites, using the local struc- 
tural comparisons (12). 



Meanwhile, since COFACTOR scoring function 
includes the global structure similarity, it is more robust 
to the local structural variations in the target structural 
models than other methods, such as ConCavity (25), 
which rely only on the local pocket comparisons. This 
allows for the COFACTOR server to identify correct 
function homologs even using low-resolution structure 
models, which is of practical importance and usefulness, 
given the fact that most protein sequences lack experi- 
mental structure and only low-resolution structure can 
be generated by computational protein structure predic- 
tions (12). 

Nevertheless, since the COFACTOR is essentially a 
template-based comparative method, no function predic- 
tions can be correctly generated if there is no homologous 
template protein present in the function libraries. It is 
therefore critical for the COFACTOR to have complete 
and updated template libraries. Currently, we have had 
the structure and ligand-binding libraries updated every 
week, since the information is collected directly from the 
PDB library (18). However, the data of GO and EC clas- 
sifications are collected from other secondary resources 
(19,29,30), the updates of which are therefore not as 
regular and rely on the update of these resources. All 
the libraries are freely downloadable at http://zhanglab. 
ccmb.med.umich.edu/COFACTOR/library.html. Finally, 
the current algorithm is designed for single chain 
proteins. If multiple chains are submitted, the first chain 
in the PDB file is used by server automatically. We plan to 
extend the algorithm for multiple chain proteins and add 
the feature to the server in near future. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1 , Supplementary Figure 1 and Sup- 
plementary References [31-33]. 
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